Bootstrapping  and  Maintaining  Trust  in  the  Cloud  * 


Nabil  Schear 

MIT  Lincoln  Laboratory 

nabil@ll.mit.edu 


Patrick  T.  Cable 

MIT  Lincoln  Laboratory 

cable@ll.mit.edu 


Thomas  M.  Moyer 

MIT  Lincoln  Laboratory 

tmoyer@ll.mit.edu 


Bryan  Richard 
MIT  Lincoln  Laboratory 
bryan.richard@ll.mit.edu 


Robert  Rudd 
MIT  Lincoln  Laboratory 
robert.rudd@ll.mit.edu 


ABSTRACT 

Today’s  infrastructure  as  a  service  (IaaS)  cloud  environ¬ 
ments  rely  upon  full  trust  in  the  provider  to  secure  appli¬ 
cations  and  data.  Cloud  providers  do  not  offer  the  ability 
to  create  hardware-rooted  cryptographic  identities  for  IaaS 
cloud  resources  or  sufficient  information  to  verify  the  in¬ 
tegrity  of  systems.  Trusted  computing  protocols  and  hard¬ 
ware  like  the  TPM  have  long  promised  a  solution  to  this 
problem.  However,  these  technologies  have  not  seen  broad 
adoption  because  of  their  complexity  of  implementation,  low 
performance,  and  lack  of  compatibility  with  virtualized  en¬ 
vironments.  In  this  paper  we  introduce  a  scalable  trusted 
cloud  key  management  system  called  keylime.  Our  sys¬ 
tem  provides  an  end-to-end  solution  for  both  bootstrap¬ 
ping  hardware  rooted  cryptographic  identities  for  IaaS  nodes 
and  for  system  integrity  monitoring  of  those  nodes  via  pe¬ 
riodic  attestation.  We  can  support  these  functions  in  both 
bare-metal  and  virtualized  IaaS  environments  using  a  virtual 
TPM.  keylime  provides  a  clean  interface  that  allows  higher 
level  security  services  like  disk/network  network  encryption 
or  configuration  management  to  leverage  trusted  comput¬ 
ing  without  themselves  being  trusted  computing  aware.  We 
show  that  our  bootstrapping  protocol  adds  about  2  seconds 
latency  to  IaaS  node  instantiation,  we  can  detect  system  in¬ 
tegrity  violations  in  as  little  as  110ms,  and  that  keylime  can 
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scale  to  thousands  of  simultaneous  cloud  nodes. 

1.  INTRODUCTION 

The  proliferation  and  popularity  of  infrastructure-as-a- 
service  (IaaS)  cloud  computing  services  such  as  Amazon 
Web  Services  and  Google  Compute  Engine  means  more  cloud 
tenants  are  hosting  sensitive,  private,  and  business  critical 
data  and  applications  in  the  cloud.  Unfortunately,  IaaS 
cloud  service  providers  do  not  currently  furnish  the  build¬ 
ing  blocks  necessary  to  establish  a  trusted  environment  for 
hosting  these  sensitive  resources.  Tenants  have  limited  abil¬ 
ity  to  verify  the  underlying  platform  when  they  deploy  to 
the  cloud  and  to  ensure  that  the  platform  remains  in  a  good 
state  for  the  duration  of  their  computation.  Additionally, 
current  practices  restrict  tenants’  ability  to  establish  unique, 
unforgeable  identities  for  individual  nodes  that  are  tied  to  a 
hardware  root  of  trust.  Often,  identity  is  based  solely  on  a 
software- based  cryptographic  solution  or  unverffiable  trust 
in  the  provider.  For  example,  tenants  often  pass  unprotected 
secrets  to  their  IaaS  nodes  via  the  cloud  provider. 

Commodity  trusted  hardware,  like  the  Trusted  Platform 
Module  (TPM)  [39] ,  has  long  been  proposed  as  the  solution 
for  bootstrapping  trust,  enabling  the  detection  of  changes  to 
system  state  that  might  indicate  compromise,  and  establish¬ 
ing  cryptographic  identities.  Unfortunately,  TPMs  have  not 
been  widely  deployed  in  IaaS  cloud  environments  due  to  a 
variety  of  challenges.  First,  the  TPM  and  related  standards 
for  its  use  are  complex  and  difficult  to  implement.  Second, 
since  the  TPM  is  a  cryptographic  co-processor  and  not  an 
accelerator,  it  can  introduce  substantial  performance  bottle¬ 
necks  (e.g.,  500+ms  to  generate  a  single  digital  signature). 
Lastly,  the  TPM  is  a  physical  device  by  design  and  most  IaaS 
services  rely  upon  virtualization  which  purposefully  divorces 
cloud  nodes  from  the  hardware  on  which  they  run.  At  best, 
the  limitation  to  physical  platforms  means  that  only  the 
cloud  provider  would  have  access  to  the  trusted  hardware, 
not  the  tenants  [17,  20,  31].  The  Xen  hypervisor  includes 
a  virtualized  TPM  implementation  that  links  its  security  to 
a  physical  TPM  [2,  10],  but  protocols  to  make  use  of  the 
vTPM  in  an  IaaS  environment  do  not  exist. 

To  address  these  challenges  we  identify  the  following  de¬ 
sirable  features  of  an  IaaS  trusted  computing  system: 

•  Secure  Bootstrapping  -  the  system  should  enable 
the  tenant  to  securely  install  an  initial  root  secret  into 
each  cloud  node.  This  is  typically  the  node’s  long  term 
cryptographic  identity  and  the  tenant  chains  other  se¬ 
crets  to  it  to  enable  secure  services. 


•  System  Integrity  Monitoring  -  the  system  should 
allow  the  tenant  to  monitor  cloud  nodes  as  they  oper¬ 
ate  and  react  to  integrity  deviations  within  one  second. 

•  Secure  Layering  (Virtualization  Support)  -  the 

system  should  support  tenant  controlled  bootstrapping 
and  integrity  monitoring  in  a  VM  using  a  TPM  in  the 
provider’s  infrastructure.  This  must  be  done  in  collab¬ 
oration  with  the  provider  in  least  privilege  manner. 

•  Compatibility  -  the  system  should  allow  the  ten¬ 
ant  to  leverage  hardware-rooted  cryptographic  keys  in 
software  to  secure  services  they  already  use  (e.g.,  disk 
encryption  or  configuration  management). 

•  Scalability  -  the  system  should  scale  to  support  boot¬ 
strapping  and  monitoring  of  thousands  of  IaaS  resources 
as  they  are  elastically  instantiated  and  terminated. 

Prior  cloud  trusted  computing  solutions  address  a  subset 
of  these  features,  but  none  achieve  all.  Excalibur  [31]  sup¬ 
ports  bootstrapping  at  scale,  but  does  not  allow  for  system 
integrity  monitoring  or  offer  full  support  for  tenant  trusted 
computing  inside  a  VM  (i.e.,  layering).  Manferdelli  et  al. 
created  a  system  that  supports  secure  layering  and  boot¬ 
strapping,  but  does  not  support  system  integrity  monitor¬ 
ing,  is  incompatible  with  existing  cryptographic  services, 
and  has  not  demonstrated  cloud  scale  operation  [25].  Fi¬ 
nally,  the  Cloud  Verifier  [34]  enables  system  integrity  mea¬ 
surement  and  cloud  scalability  but  does  not  fully  address 
secure  layering  or  enable  secure  bootstrapping. 

In  this  paper,  we  introduce  keylime;  an  end-to-end  IaaS 
trusted  cloud  key  management  service  that  supports  all  the 
above  desired  features.  The  key  insight  of  our  work  is  to 
utilize  trusted  computing  to  bootstrap  identity  in  the  cloud 
and  provide  integrity  measurement  to  support  revocation, 
but  then  allow  high-level  services  that  leverage  these  iden¬ 
tities  to  operate  independently.  Thus,  we  provide  a  clean 
and  easy  to  use  interface  that  can  integrate  with  existing 
security  technologies  (see  Figure  1). 

We  introduce  a  novel  bootstrap  key  derivation  protocol 
that  combines  both  tenant  intent  and  integrity  measurement 
to  install  secrets  into  cloud  nodes.  We  then  leverage  the 
Cloud  Verifier  [34]  pattern  of  Schiffman  et  al.  to  enable  pe¬ 
riodic  attestation  that  automatically  links  to  identity  revo¬ 
cation.  keylime  supports  the  above  with  secure  layering  in 
both  bare-metal  and  virtualized  IaaS  resources  in  a  manner 
that  minimizes  trust  in  the  cloud  provider.  We  demonstrate 
the  compatibility  of  keylime  by  securely  enabling  cloud  pro¬ 
visioning  with  cloud-init1,  encrypted  communication  with 
IPsec,  configuration  management  with  Puppet2,  secret  man¬ 
agement  with  Vault3 ,  and  storage  with  LUKS / dm-crypt  en¬ 
crypted  disks.  Unlike  existing  solutions  [38,  25],  these  ser¬ 
vices  don’t  need  to  be  trusted  computing  aware,  they  just 
need  to  use  an  identity  key  and  respond  to  key  revocations. 

Finally,  we  show  that  keylime  can  scale  to  handle  thou¬ 
sands  of  simultaneous  nodes  and  perform  integrity  checks 
on  nodes  at  rates  up  to  2,500  integrity  reports  (quotes)  ver¬ 
ified  per  second.  We  present  and  evaluate  multiple  options 
for  deploying  our  integrity  measurement  verifier  both  in  the 
cloud,  in  a  low-cost  cloud  appliance  based  on  a  Raspberry 

1  htt  p :  /  /launchpad .  net  /  cloud-  init 

2http:/ /puppetlabs.com/ 

3  http :  /  /hashicorp  .com /blog  /  vault  .ht  ml 


Figure  1:  Interface  between  trusted  hardware  and 
existing  software-based  security  services  via  the 
keylime  trusted  computing  service  layer. 

Pi,  and  on-premises.  We  show  that  the  overhead  of  secure 
provisioning  imposes  approximately  2s  overhead  on  creat¬ 
ing  new  VMs.  Finally,  we  find  that  our  system  can  detect 
integrity  measurement  violations  in  as  little  as  110ms. 

2.  BACKGROUND 

Trusted  Computing  The  TPM  provides  the  means  for 
creating  trusted  systems  that  are  amenable  to  system  in¬ 
tegrity  monitoring.  The  TPM,  as  specified  by  the  Trusted 
Computing  Group  (TCG)4,  is  a  cryptographic  co-processor 
that  provides  key  generation,  protected  storage,  and  crypto¬ 
graphic  operations.  The  protected  storage  includes  a  set  of 
Platform  Configuration  Registers  (PCRs)  where  the  TPM 
stores  hashes.  The  TPM  uses  these  registers  to  store  mea¬ 
surements  of  integrity-relevant  components  in  the  system. 

To  store  a  new  measurement  in  a  PCR,  the  extend  op¬ 
eration  concatenates  the  existing  PCR  value  with  the  new 
measurement,  SHA-1  hashes  that  value,  and  stores  the  re¬ 
sulting  hash  in  the  register.  This  hash  chain  allows  a  veri¬ 
fier  to  confirm  that  a  set  of  measurements  reported  by  the 
system  has  not  been  altered.  This  report  of  measurements 
is  called  an  attestation ,  and  relies  on  the  quote  operation, 
which  accepts  a  nonce5  and  a  set  of  PCRs.  This  attesta¬ 
tion  can  include  measurements  of  the  BIOS,  firmware,  boot 
loader,  hypervisor,  OS,  and  applications,  depending  on  the 
configuration  of  the  system.  The  TPM  reads  the  PCR  val¬ 
ues,  and  then  signs  the  nonce  and  PCRs  with  a  key  that  is 
only  accessible  by  the  TPM.  The  key  the  TPM  uses  to  sign 
quotes  is  called  an  attestation  identity  key  (AIK). 

The  TPM  contains  a  key  hierarchy  for  securely  storing 
cryptographic  keys.  The  root  of  this  hierarchy  is  the  Storage 
Root  Key  (SRK)  which  the  owner  generates  during  TPM  ini¬ 
tialization.  The  SRK  in  turn  protects  the  TPM  AIK(s)  when 
they  are  stored  outside  of  the  TPM’s  nonvolatile  storage 
(NVRAM).  Each  TPM  also  contains  a  permanent  credential 
called  the  Endorsement  Key  (EK).  The  TPM  manufacturer 
generates  and  signs  the  EK.  The  EK  uniquely  identifies  each 
TPM  and  certifies  that  it  is  a  valid  TPM  hardware  device. 
The  private  EK  never  leaves  the  TPM,  is  never  erased,  and 
can  only  be  used  for  encryption  and  decryption  during  AIK 
initialization  to  limit  its  exposure. 

Integrity  Measurement  To  measure  a  system  component, 

4  htt  p :  /  /  tr  ust  edcomput  inggroup .  org 

5  The  quote  takes  in  a  field  called  externalData  which  can 
be  any  arbitrary  value  that  should  be  signed  along  with  the 
PCR  values  in  the  quote.  While  this  is  typically  a  nonce  to 
demonstrate  freshness,  the  externalData  can  also  be  hashed 
with  other  data  that  should  be  bound  to  the  quote. 


the  underlying  component  must  be  trusted  computing -aw  are. 
The  BIOS  in  systems  with  a  TPM  supports  measurement  of 
firmware  and  boot  loaders.  TPM-aware  boot  loaders  can 
measure  hypervisors  and  operating  systems  [22,  19,  29].  To 
measure  applications,  the  operating  system  must  support 
measurement  of  applications  that  are  launched,  such  as  the 
Linux  Integrity  Measurement  Architecture  [30,  21].  One 
limitation  of  approaches  like  IMA  is  the  inability  to  mon¬ 
itor  the  run-time  state  of  the  applications.  Nexus  aims  to 
address  this  limitation  with  a  new  OS  that  makes  trusted 
computing  a  first-class  citizen,  and  supports  introspection 
to  validate  run-time  state  [36].  Several  proposals  exist  for 
providing  run-time  integrity  monitoring  including  LKIM  [24] 
and  DynIMA  [8] .  These  systems  ensure  that  a  running  sys¬ 
tem  is  in  a  known  state,  allowing  a  verifier  to  validate  not 
only  that  what  was  loaded  was  known,  but  that  it  has  not 
been  tampered  with  while  it  was  running. 

In  addition  to  operating  system  validation,  others  have 
leveraged  trusted  computing  and  integrity  measurement  to 
support  higher-level  services,  such  as  protected  access  to 
data  when  the  client  is  offline  [23],  or  to  enforce  access 
policies  on  data  [26].  Others  have  proposed  mechanisms 
to  protect  the  server  from  malicious  clients,  e.g.,  in  online 
gaming  [1],  or  applications  from  a  malicious  operating  sys¬ 
tem  [6,  7,  15].  However,  these  proposals  do  not  account  for 
the  challenges  of  migrating  applications  to  a  cloud  environ¬ 
ment,  and  often  assume  existing  infrastructure  to  support 
trusted  computing  key  management. 

IaaS  Cloud  Services  In  the  IaaS  cloud  service  model,  users 
request  an  individual  compute  resource  to  execute  their  ap¬ 
plication.  For  example,  users  can  provision  physical  hard¬ 
ware,  virtual  machines,  or  containers.  In  this  paper,  we 
refer  to  any  of  these  tenant-provisioned  IaaS  resources  as 
cloud  nodes.  Users  provision  nodes  either  by  uploading  a 
whole  image  to  the  provider  or  by  configuring  a  pared-down 
base  image  that  the  provider  makes  available.  Users  often 
begin  by  customizing  a  provider-supplied  image,  then  create 
their  own  images  (using  a  tool  like  Packer6)  to  decrease  the 
amount  of  time  it  takes  for  a  node  to  become  ready. 

cloud- init  is  a  standard  mechanism  across  cloud  providers 
(e.g.,  Amazon  EC2,  Microsoft  Azure...)  that  allows  cloud 
tenants  to  specify  bootstrapping  information.  It  accepts 
a  YAML-formatted  description  of  what  bootstrapping  ac¬ 
tions  should  be  taken  and  supports  plugins  to  take  those 
actions.  Examples  of  such  actions  include:  adding  users, 
adding  package  repositories,  or  running  arbitrary  scripts. 
Users  of  cloud  computing  resources  at  scale  typically  spawn 
new  cloud  instances  using  an  application  programming  in¬ 
terface  and  pass  along  enough  bootstrapping  information 
to  allow  the  instance  to  communicate  with  a  configuration 
management  platform  (such  as  Chef7  or  Puppet,  etc.)  for 
further  application/instance-specific  configuration.  These 
bootstrapping  instructions  are  not  encrypted,  meaning  that 
a  provider  could  intercept  secrets  passed  via  the  bootstrap¬ 
ping  instructions.  In  our  research,  we  found  that  organi¬ 
zations  will  either  (a)  send  an  unprotected  pre-shared  key 
for  Puppet  in  their  cloud- init  bootstrapping  actions,  or 
(b)  rely  on  some  weaker  method  of  proving  identity  such  as 
going  off  the  certificate’s  common  name  (hostname). 


6  https:/  /  www.packer.io/ 

7  https:/  /  www.chef.io /chef/ 
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Figure  2:  Physical  node  registration  protocol. 


3.  DESIGN 

To  address  the  limitations  of  current  approaches,  we  con¬ 
sider  the  union  of  trusted  computing  and  IaaS  to  provide 
a  hardware  root-of-trust  that  tenants  leverage  to  establish 
trust  in  the  cloud  provider’s  infrastructure  and  in  their  own 
systems  running  on  that  infrastructure.  This  section  consid¬ 
ers  the  threats  that  key  lime  addresses,  and  how  to  leverage 
existing  trusted  computing  constructs  in  a  virtualized  envi¬ 
ronment  while  limiting  complexity  and  overhead. 

3.1  Threat  Model 

Our  goal  is  to  minimize  trust  in  the  cloud  provider  and 
carefully  account  for  all  concessions  we  must  make  to  enable 
trusted  computing  services.  We  assume  the  cloud  provider  is 
semitrusted,  i.e.,  they  are  organizationally  trustworthy  but 
are  still  susceptible  to  compromise  or  malicious  insiders.  We 
assume  the  cloud  provider  has  processes,  technical  controls, 
and  policy  in  place  to  limit  the  impact  of  such  compromise 
from  spreading  across  their  entire  infrastructure.  Thus,  in 
the  semitrusted  model,  we  assume  that  some  fraction  of  the 
cloud  provider’s  resources  may  be  under  the  control  of  the 
adversary  (e.g.,  a  rogue  system  administrator  may  control  a 
subset  of  racks  in  an  IaaS  region). 

Specifically,  we  assume  that  the  adversary  can  monitor 
or  manipulate  compromised  portions  of  the  cloud  network 
or  storage  arbitrarily.  We  assume  that  the  adversary  may 
not  physically  tamper  with  any  hypervisor  host’s  CPU,  bus, 
memory,  or  TPM8.  We  further  assume  that  the  provider 
does  not  deploy  a  hypervisor  with  the  capability  to  spy 
on  tenant  VM  memory  (e.g.,  Ether  [9]).  We  assume  that 
TPM  and  system  manufacturers  have  created  the  appropri¬ 
ate  endorsement  credentials  and  have  some  mechanism  to 
test  their  validity  (i.e.,  signed  certificates) 

Finally,  we  assume  that  the  attacker’s  goal  is  to  obtain 
persistent  access  to  a  tenant  system  in  order  to  steal,  dis¬ 
rupt,  or  deny  the  tenant’s  data  and  services.  To  accomplish 
persistence  the  attacker  must  modify  the  code  loading  or 
running  process.  We  assume  that  such  modifications  would 
be  detected  by  load-time  integrity  measurement  of  the  hy¬ 
pervisor  or  kernel  [19],  runtime  integrity  measurement  of  the 
kernel  [24] ,  and  integrity  measurement  of  applications  [30] . 

3.2  Architecture 

To  introduce  the  architecture  of  key  lime  we  first  describe 
a  simplified  architecture  for  managing  trusted  computing 
services  for  a  single  organization,  or  cloud  tenant,  without 
virtualization.  We  then  expand  this  simplified  architecture 
into  the  full  key  lime  architecture,  providing  extensions  that 
allow  layering  of  provider  and  tenant  trusted  computing  ser- 

8 This  is  similar  to  the  threat  model  assumed  by  the  TPM, 
where  physical  protections  are  not  a  strict  requirement  to 
be  compliant  with  the  specification. 


vices  and  supporting  multiple  varieties  of  IaaS  execution 
isolation  (i.e.,  bare  metal,  virtual  machines,  or  containers). 
Figure  3  depicts  the  full  system  architecture  with  layering. 

The  first  step  in  bootstrapping  the  architecture  is  to  create 
a  tenant-specific  registrar.  The  registrar  stores  and  certifies 
the  public  AIKs  of  the  TPMs  in  the  tenant’s  infrastructure. 
In  the  simplified  architecture,  the  tenant  registrar  can  be 
hosted  outside  the  cloud  in  the  tenant’s  own  infrastructure 
or  could  be  hosted  on  a  physical  system  in  the  cloud.  The 
registrar  is  only  a  trust  root  and  does  not  store  any  tenant 
secrets,  so  the  tenant  can  decide  to  trust  the  registrar  only 
after  it  attests  its  system  integrity. 

To  create  a  registrar,  we  can  leverage  existing  standards 
for  the  creation  and  validation  of  AIKs  and  credentials  by 
creating  a  TCG  Privacy  CA  [37].  To  avoid  the  complex¬ 
ity  of  managing  a  complex  PKI,  we  also  created  a  registrar 
that  simply  stores  valid  TPM  AIK  public  keys  indexed  by 
node  UUID.  Clients  request  public  AIKs  from  the  registrar 
through  a  server- authenticated  TLS  channel. 

To  validate  the  AIKs  in  the  registrar,  we  developed  a  vari¬ 
ant  of  the  TCG  privacy  CA  protocol  that  has  the  same 
security  guarantees  (see  Figure  2).  The  protocol  begins 
with  the  node  passing  its  ID  and  standard  TPM  credentials 
(E K pub ,  AIK pub)  to  the  registrar.  The  registrar  then  checks 
the  validity  of  the  TPM  EK  with  the  TPM  manufacturer. 
If  valid,  the  registrar  creates  an  ephemeral  key  Ke  and  en¬ 
crypts  it  with  the  TPM  EKpub.  Importantly,  the  generation 
and  installation  of  the  EK  by  the  TPM  manufacturer  roots 
the  trust  upon  which  the  rest  of  our  system  relies.  Since  we 
are  not  managing  certificates  or  concerned  with  the  privacy 
of  AIKs,  we  use  an  HMAC  to  prove  that  the  node  is  able 
to  decrypt  the  ephemeral  key  Ke  encrypted  with  the  EK 
using  the  TPM  Activatel dentity  command.  The  registrar 
then  marks  that  AIK  as  being  valid  so  that  it  can  be  used 
to  validate  quotes. 

The  core  component  of  key  lime  is  an  out  of  band  cloud 
verifier  (CV)  similar  to  the  one  described  by  Schiffman  et 
al.  [34].  Each  cloud  organization  will  have  at  least  one  CV 
that  is  responsible  for  verifying  the  system  state  of  the  orga¬ 
nization’s  IaaS  resources.  The  tenant  can  host  the  CV  in  the 
IaaS  cloud  or  on-premises  at  their  own  site  (we  give  options 
for  tenant  registrar  and  CV  deployment  in  Section  3.2.1). 
The  CV  relies  upon  the  tenant  registrar  for  validating  that 
the  AIKs  used  to  sign  TPM  quotes  are  valid.  The  regis¬ 
trar,  CV,  and  the  associated  cloud  node  service  are  the  only 
components  in  key  lime  that  manage  and  use  keys  and  pub¬ 
lic  key  infrastructures  associated  with  the  TPM. 

The  CV  participates  in  a  three  party  key  derivation  pro¬ 
tocol  (we  describe  in  detail  in  Section  3.2.2)  where  the  CV 
and  tenant  cooperate  to  derive  a  key,  Kb ,  at  the  cloud  node 
to  support  initial  storage  decryption.  The  tenant  uses  Kb  to 
protect  tenant  secrets  and  trust  relationships.  The  tenant 
can  use  this  key  to  unlock  either  its  disk  image  or  to  unlock 
tenant-specific  configuration  provided  by  cloud-init. 

This  protocol  is  akin  to  the  method  by  which  a  user  can 
use  the  TPM  to  decrypt  his  or  her  disk  in  a  laptop.  To  allow 
the  decryption  key  to  be  used  to  boot  the  laptop,  the  user 
must  enter  a  password  (demonstrating  the  user’s  intent)  and 
TPM  PCRs  must  match  a  set  of  whitelisted  integrity  mea¬ 
surements  (demonstrating  the  validity  of  the  system  that 
will  receive  the  encryption  key).  In  an  IaaS  cloud  environ¬ 
ment,  there  is  neither  a  trusted  console  where  a  user  can 
enter  a  password  nor  is  there  a  way  to  pre-seed  the  TPM 


with  the  storage  key  or  measurement  whitelist.  Our  pro¬ 
tocol  uses  secret  sharing  to  solve  these  problems  by  rely¬ 
ing  externally  upon  the  CV  for  integrity  measurement  and 
by  having  the  tenant  directly  interact  with  the  cloud  node 
to  demonstrate  intent  to  derive  a  key.  The  protocol  then 
extends  beyond  bootstrapping  to  enable  continuous  system 
integrity  monitoring.  The  CV  periodically  polls  each  cloud 
node’s  integrity  state  to  determine  if  any  runtime  policies 
have  been  violated.  The  frequency  with  which  the  CV  re¬ 
quests  and  verifies  each  node’s  integrity  state  will  define  the 
latency  between  an  integrity  violation  and  detection. 

To  cleanly  link  trust  and  integrity  measurement  rooted 
in  the  TPM  to  higher- level  services,  we  create  a  parallel 
software-only  PKI  and  a  simple  service  to  manage  it.  We 
refer  to  this  service  as  the  software  CA.  To  bootstrap  this 
service,  we  use  the  key  derivation  bootstrap  protocol  to  cre¬ 
ate  a  cloud  node  to  host  the  service.  Since  the  bootstrap 
key  derivation  protocol  ensures  that  the  node  can  only  de¬ 
rive  a  key  if  the  tenant  authorizes  it  and  if  the  node’s  in¬ 
tegrity  state  is  approved,  we  can  encrypt  the  private  key  for 
the  software  CA  and  pass  it  to  the  node  upon  provisioning. 
Once  established,  we  can  then  start  other  cloud  nodes  and 
securely  pass  them  keys  signed  by  this  CA.  The  linkage  to 
the  hardware  root  of  trust,  the  secure  bootstrapping  of  rele¬ 
vant  keys,  and  user  intent  to  create  new  resources  are  again 
ensured  using  the  bootstrap  key  derivation  protocol.  Once 
established,  standard  tools  and  services  like  IPsec  or  Pup¬ 
pet  can  now  directly  use  the  software  CA  identity  credentials 
that  each  node  now  possesses. 

To  complete  the  linkage  between  the  trusted  computing 
services  and  the  software  identity  keys,  we  need  a  mechanism 
to  revoke  keys  in  the  software  PKI  when  integrity  violations 
occur  in  the  trusted  computing  layer.  The  CV  is  responsible 
for  notifying  the  software  CA  of  these  violations.  The  CV 
includes  metadata  about  the  nature  of  the  integrity  viola¬ 
tion,  which  allows  the  software  CA  to  have  a  response  policy. 
The  software  CA  supports  standardized  methods  for  certifi¬ 
cate  revocation  like  signed  revocation  lists  or  by  hosting  an 
OCSP  responder.  To  support  push  notifications  of  failures, 
the  software  CA  can  also  publish  signed  notifications  to  a 
message  bus.  This  way  services  that  directly  support  revo¬ 
cation  actions  can  subscribe  to  notifications  (e.g.,  to  trigger 
a  re- key  in  a  secret  manager  like  Vault). 

3.2.1  Layering  Trust 

We  next  expand  this  architecture  to  work  across  the  layers 
of  virtualization  common  in  today’s  IaaS  environments.  Our 
goal  is  to  create  the  architecture  described  previously  that 
cleanly  links  common  security  services  to  a  trusted  comput¬ 
ing  layer  in  a  cloud  tenant’s  environment.  Thus,  in  a  VM 
hosting  environment  like  Amazon  AWS  or  OpenStack,  we 
aim  to  create  trusted  computing  enabled  software  CAs  and 
tenant  nodes  inside  of  virtual  machine  instances.  Note  that, 
in  a  bare-metal  provisioning  environment  like  IBM  Softlayer, 
HaaS  [13],  or  OpenStack  Ironic,  we  can  directly  utilize  the 
simplified  architecture  where  there  is  no  trust  layering. 

We  observe  that  IaaS-based  virtual  machines  or  physical 
hosts  all  provide  a  common  abstraction  of  isolated  execu¬ 
tion.  Each  form  of  isolated  execution  in  turn  needs  a  root  of 
trust  on  which  to  build  trusted  computing  services.  Due  to 
the  performance  and  resource  limitations  of  typical  TPMs 
(e.g.,  taking  500  or  more  milliseconds  to  generate  a  quote, 
and  only  supporting  a  fixed  number  of  PCRs),  direct  multi- 
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Figure  3:  Layered  keylime  trusted  computing  architecture. 


plexed  use  of  the  physical  TPM  will  not  scale  to  the  numbers 
of  virtual  machines  that  can  be  hosted  on  a  single  modern 
system.  As  described  by  Berger  et  al.  [2]  and  as  implemented 
in  Xen  [10],  we  utilize  a  virtualized  implementation  of  the 
TPM.  Each  VM  has  its  own  software  TPM  (called  a  vTPM) 
whose  trust  is  in  turn  rooted  in  the  hardware  TPM  of  the 
hosting  system.  The  vTPM  is  isolated  from  the  guest  that 
use  it,  by  running  in  a  separate  Xen  domain. 

The  vTPM  interface  is  the  same  as  a  hardware  TPM. 
The  only  exception  to  this,  is  that  the  client  can  request  a 
deep-quote  that  will  get  a  quote  from  the  hardware  TPM  in 
addition  to  getting  a  quote  from  the  vTPM.  These  quotes 
are  linked  together  by  including  a  hash  of  the  vTPM  quote 
and  nonce  in  the  hardware  TPM  quote.  Deep  quotes  suffer 
from  the  slow  performance  of  hardware  TPMs,  but  as  we’ll 
show  in  later  this  section,  we  can  limit  the  use  of  deep  quotes 
while  still  maintaining  reasonable  performance  and  scale  and 
maintaining  security  guarantees. 

To  assure  a  chain  of  trust  that  is  rooted  in  hardware,  we 
need  the  IaaS  provider  to  replicate  some  of  the  trusted  com¬ 
puting  service  infrastructure  in  their  environment  and  allow 
the  tenant  trusted  computing  services  to  query  it.  Specifi¬ 
cally,  the  provider  must  establish  a  registrar  for  their  infras¬ 
tructure,  must  publish  an  up-to-date  signed  list  of  the  in¬ 
tegrity  measurements  of  their  infrastructure,  and  may  even 
have  their  own  CV.  The  tenant  CV  will  interact  with  the 
whitelist  authority  service  and  the  provider’s  registrar  to 
verify  deep  quotes  collected  by  the  infrastructure. 

Despite  the  fact  that  most  major  IaaS  providers  run  closed- 
source  hypervisors  and  would  provide  what  amounts  to  opaque 
integrity  measurements  [16],  we  find  there  is  still  value  in 
verifying  the  integrity  of  the  provider’s  services.  By  pro¬ 
viding  a  known-good  list  of  integrity  measurements,  the 
provider  is  committing  to  a  version  of  the  hypervisor  that 
will  be  deployed  widely  across  their  infrastructure.  This  pre¬ 
vents  a  targeted  attack  where  the  a  single  hypervisor  is  re¬ 
placed  with  a  malicious  version  designed  to  spy  on  the  tenant 
(e.g.,  the  provider  is  coerced  by  a  government  to  monitor  a 
specific  cloud  tenant).  Thus,  an  attacker  must  subvert  both 
the  code  loading  process  on  all  the  hypervisors  and  the  pub¬ 
lishing  and  signing  process  for  known-good  measurements. 
In  our  semitrusted  threat  model,  we  assume  the  provider  has 
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Figure  4:  Virtual  node  registration  protocol. 


controls  and  monitoring  which  limit  the  ability  of  a  rogue 
individual  to  accomplish  this. 

As  in  Section  3.2,  we  begin  with  the  establishment  of  a 
tenant  registrar  and  cloud  verifier.  There  are  multiple  op¬ 
tions  for  hosting  these  services  securely:  1)  in  a  bare  metal 
IaaS  instance  with  TPM,  2)  on-tenant-premises  in  tenant- 
owned  hardware,  3)  in  a  small  trusted  hardware  appliance 
deployed  to  the  IaaS  cloud  provider,  and  4)  in  an  IaaS  vir¬ 
tual  machine.  The  first  three  of  these  options  rely  upon 
the  architecture  and  protocols  we’ve  already  discussed.  The 
last  option  requires  the  tenant  to  establish  an  on-tenant- 
premises  CV  and  use  that  to  bootstrap  the  tenant  registrar 
and  CV.  This  on-tenant-premises  CV  identifies  and  checks 
the  integrity  of  the  tenant’s  virtualized  registrar  and  CV, 
who  then  in  turn  are  responsible  for  the  rest  of  the  tenant’s 
virtualized  infrastructure. 

The  primary  motivations  for  a  tenant  choosing  between 
these  options  are  the  detection  latency  for  integrity  viola¬ 
tions,  scale  of  IaaS  instances  in  their  environment,  band¬ 
width  between  the  tenant  and  the  IaaS  cloud,  and  cost.  Op¬ 
tion  1  provides  maximum  performance  but  at  higher  cost. 
Option  2  will  by  limited  by  bandwidth  and  requires  more 
costs  to  maintain  resources  outside  of  the  cloud.  Option  3  is 
a  good  trade-off  between  cost  and  performance  for  a  small 
cloud  tenant  with  only  tens  of  nodes  or  who  can  tolerate  a 
longer  detection  latency.  Finally,  Option  4  provides  compat¬ 
ibility  with  current  cloud  operations,  good  performance  and 
scalability,  and  low  cost  at  the  expense  of  increased  complex¬ 
ity.  In  Section  5,  we  examine  the  performance  trade-offs  of 
these  options  including  a  low-cost  registrar  and  CV  appli¬ 
ance  (Option  3)  we  implemented  on  a  Raspberry  Pi. 


Once  we  have  created  the  tenant  registrar  and  CV,  we  can 
begin  securely  bootstrapping  nodes  into  the  environment. 
As  before,  the  first  node  to  establish  is  a  virtualized  software 
CA  and  we  do  this  by  creating  a  private  signing  key  offline 
and  protecting  it  with  a  key  that  will  be  derived  by  the 
bootstrap  key  derivation  protocol.  The  following  process 
will  be  the  same  for  all  tenant  cloud  nodes.  When  a  node 
boots,  it  will  get  a  vTPM  from  the  IaaS  provider. 

We  can  enroll  the  vTPM  into  the  tenant  registrar  by  gen¬ 
erating  the  appropriate  vTPM  credentials  e.g.,  (EK,  AIK) 
and  by  requesting  a  deep  quote  bound  to  those  creden¬ 
tials  from  the  hardware  TPM  on  the  host  on  which  it  runs 
(see  Figure  4).  We  denote  a  quote  or  deep  quote  using 
Quote aik  (external Data,  [PC Rs])  for  a  quote  using  AIK 
from  the  TPM  with  the  associated  externalData  and  op¬ 
tional  PCR  values  included  in  the  signed  data.  Upon  re- 
ceiveing  the  deep  quote,  the  tenant  registrar  then  asks  the 
provider  registrar  if  the  AIK  from  the  deep  quote  is  valid. 
Finally,  the  tenant  registrar  will  request  the  latest  valid  in¬ 
tegrity  measurements  from  the  provider  and  check  the  PCRs 
in  the  deep  quote.  Once  enrolled,  the  node  can  use  the 
bootstrap  key  derivation  protocol  to  unlock  its  private  CA 
signing  key  and  begin  operation. 

When  considering  the  cost  of  performing  a  deep  quote, 
the  provider  must  carefully  consider  the  additional  latency 
of  the  physical  TPM.  Deep  quotes  provide  a  link  between 
the  vTPM  and  the  physical  TPM  of  the  machine,  and  new 
enrollments  should  always  include  deep  quotes.  When  con¬ 
sidering  if  deep  quotes  should  be  used  as  part  of  periodic 
attestation,  we  must  understand  what  trusted  computing 
infrastructure  the  provider  has  deployed.  If  the  provider  is 
doing  load  time  integrity  only  (e.g.,  secure  boot),  then  deep 
quotes  will  only  reflect  the  one-time  binding  at  boot  between 
the  vTPM  and  the  physical  TPM  and  the  security  of  the 
vTPM  infrastructure.  If  the  provider  has  runtime  integrity 
checking  of  their  infrastructure,  there  is  value  in  the  ten¬ 
ant  performing  periodic  attestation  using  deep  quotes.  In 
the  optimal  deployment  scenario,  the  provider  can  deploy 
key  lime  and  provide  tenants  with  access  to  the  integrity 
state  of  the  hypervisors  that  host  tenant  nodes.  To  limit 
the  impact  of  slow  hardware  TPM  operations,  the  provider 
can  utilize  techniques  like  batch  attestation  where  multiple 
deep  quote  requests  from  different  vTPMs  can  be  combined 
into  a  single  hardware  TPM  operation  [28,  31]. 

3.2.2  Key  Derivation  Protocol 

We  now  introduce  the  details  of  our  bootstrap  key  deriva¬ 
tion  protocol  (Figure  5).  The  goal  of  this  protocol  is  for  the 
cloud  tenant  to  obtain  key  agreement  with  a  cloud  node  they 
have  provisioned  in  an  IaaS  system.  The  protocol  relies  upon 
the  CV  to  provide  integrity  measurement  of  the  cloud  node 
during  the  protocol.  The  tenant  also  directly  interacts  with 
the  cloud  node  to  demonstrate  their  intent  to  spawn  that  re¬ 
source  and  allow  it  to  decrypt  sensitive  contents.  However, 
the  tenant  does  not  directly  perform  integrity  measurement. 
This  separation  of  duties  is  beneficial  because  the  attesta¬ 
tion  protocols  may  operate  in  parallel  and  it  simplifies  de¬ 
ployment  by  centralizing  all  integrity  measurement,  white 
lists,  and  policy  in  the  CV. 

To  begin  the  process,  the  tenant  generates  a  fresh  random 
symmetric  encryption  key  Kb.  The  cloud  tenant  uses  AES- 
GCM  to  encrypt  the  sensitive  data  to  pass  to  the  node  d  with 
Kb,  denoted  EncKh(d).  The  tenant  then  performs  trivial 


secret  sharing  to  split  Kb  into  two  parts  £7,  which  the  tenant 
will  retain  and  pass  directly  to  the  cloud  node  and  V,  which 
the  tenant  will  share  with  the  CV  to  provide  to  the  node 
upon  successful  verification  of  the  node’s  integrity  state.  To 
obtain  these  shares  the  tenant  generates  a  secure  random 
value  V  the  same  length  as  Kb  and  computes  U  =  Kb  ®  V . 

In  the  next  phase  of  the  protocol,  the  tenant  requests  the 
IaaS  provider  to  instantiate  a  new  resource  (i.e.,  a  new  vir¬ 
tual  machine) .  The  tenant  sends  EncKb  (d)  to  the  provider 
as  part  of  the  resource  creation.  The  data  d  may  be  configu¬ 
ration  meta-data  like  a  cloud-init  script9.  Upon  creation, 
the  provider  returns  a  unique  identifier  for  the  node  uuid 
and  an  IP  address  at  which  the  tenant  can  reach  the  node. 

After  obtaining  the  node  uuid  and  IP  address,  the  tenant 
notifies  the  CV  of  their  intent  to  boot  a  cloud  node  (see  area 
A  in  Figure  5).  The  tenant  connects  to  the  CV  over  a  secure 
channel,  such  as  mutually  authenticated  TLS,  and  provides 
v,  uuid ,  node  IP,  and  a  TPM  policy.  The  TPM  policy  speci¬ 
fies  a  white  list  of  acceptable  PCR  values  to  expect  from  the 
TPM  of  the  cloud  node.  At  this  point  the  CV  and  tenant 
can  begin  the  attestation  protocol  in  parallel. 

The  attestation  protocol  of  our  scheme  is  shared  between 
the  interactions  of  the  CV  and  the  cloud  node  ( B )  and  that 
of  the  tenant  and  the  cloud  node  (C)  with  only  minor  dif¬ 
ferences  between  them  (Figure  5).  The  protocol  consists 
of  two  message  volleys  the  first  for  the  initiator  (either  CV 
or  tenant)  to  request  a  TPM  quote  and  the  second  for  the 
initiator  to  provide  a  share  of  Kb  to  the  cloud  node  upon 
successful  validation  of  the  quote.  Since  we  use  this  pro¬ 
tocol  to  bootstrap  keys  into  the  system,  there  are  no  ex¬ 
isting  software  keys  with  which  we  create  secure  channel. 
Thus,  this  protocol  must  securely  transmit  a  share  of  Kb 
over  an  untrusted  network.  We  accomplish  this  by  creat¬ 
ing  an  ephemeral  asymmetric  key  pair  on  the  node,  denoted 
NK,  outside  of  the  TPM10.  We  use  a  TPM  quote  to  bind 
this  key  to  the  identity  of  the  TPM.  We  do  so  by  including  a 
hash  of  NKpub  along  with  the  nonce  in  the  extra  data  field 
in  the  quote.  The  initiator  can  then  encrypt  its  share  of  Kb 
using  NKpub  and  securely  return  it  to  the  cloud  node. 

The  differences  in  the  attestation  protocol  between  CV 
and  tenant  arise  in  how  each  validates  TPM  quotes.  Be¬ 
cause  we  wish  to  centralize  the  adjudication  of  integrity 
measurements  to  the  CV,  the  TPM  quote  that  the  tenant 
requests  only  verifies  the  identity  of  the  cloud  node’s  TPM 
and  doesn’t  include  any  PCR  hashes.  Since  the  tenant  gen¬ 
erates  a  fresh  Kb  for  each  cloud  node,  we  are  not  concerned 
with  leaking  U  to  a  node  with  invalid  integrity  state.  Fur¬ 
thermore,  because  V  is  only  one  share  of  Kb ,  the  CV  cannot 
be  subverted  to  decrypt  resources  without  user  intent. 

We  now  describe  this  protocol  in  detail.  The  initiator 
first  sends  a  fresh  nonce  (nt  for  the  tenant  and  ncv  for  the 
cloud  verifier)  to  the  cloud  node  along  with  a  mask  indicat¬ 
ing  which  PCRs  the  cloud  node  should  include  in  their  quote. 
The  CV  sets  the  mask  based  on  TPM  policy  exchanged  ear- 


9 Because  Kb  may  not  be  re-used  in  our  protocol,  the  cost 
of  re-encrypting  large  disk  images  for  each  node  may  be 
prohibitive.  We  advocate  for  creating  small  sensitive  data 
packets  like  a  cloud-init  script,  and  then  establish  local 
storage  encryption  with  ephemeral  keys. 

10 NK  could  also  be  generated  and  reside  inside  the  TPM. 
However,  since  it  is  ephemeral,  is  only  used  for  transport 
security  and  the  quote  binds  it  to  the  TPM,  we  found  the 
added  complexity  of  also  storing  it  in  the  TPM  unneeded. 
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Figure  5:  Three  Party  Bootstrap  Key  Derivation  Protocol. 


lier  and  the  tenant  creates  an  empty  mask.  The  initiator 
requests  a  quote  from  the  TPM  with  the  given  policy.  The 
externalData  of  the  quote  is  set  to  S  H  Al{n\N  Kpub) .  The 
node  then  returns  Quote ai k  (externalData,  PCRs ),  NKpub 
to  the  initiator.  During  the  protocol  to  provide  U,  the  ten¬ 
ant  also  supplies  HMACKb(ID)  to  the  node.  This  provides 
the  node  with  a  quick  check  to  determine  if  Kb  is  correct. 

The  initiator  then  confirms  that  the  AIK  is  valid  according 
to  the  tenant  registrar  over  server  authenticated  TLS.  If  the 
initiator  is  the  CV,  then  it  will  also  check  the  other  PCRs 
to  ensure  they  are  valid  according  to  the  tenant-specified 
whitelist.  If  the  node  is  virtual,  then  the  quote  to  the  CV 
will  also  include  a  deep  quote  of  the  underlying  hardware 
TPM.  The  CV  will  in  turn  validate  it  as  described  in  the 
previous  section.  Upon  successful  verification,  the  initiator 
can  then  return  their  share  of  Kb .  Thus,  the  tenant  sends 
Eticnk(U )  and  the  cloud  verifier  sends  Eticnk(V)  to  the 
node.  The  cloud  node  can  now  recover  Kb  and  proceed  with 
the  boot/startup  process. 

The  cloud  node  does  not  retain  Kb  or  V  after  decryption 
of  d.  To  support  node  reboot  or  migration,  the  cloud  node 
stores  U  in  the  TPM  NVRAM  to  avoid  needing  the  tenant 
to  interact  again.  After  rebooting,  the  node  must  again 
request  verification  by  the  CV  to  obtain  V  and  re-derive 
Kb-  If  migration  is  allowed,  the  provider  must  take  care  to 
also  securely  migrate  vTPM  state  to  avoid  losing  U. 

4.  IMPLEMENTATION 

We  implemented  key  lime  in  approximately  5,000  lines  of 
Python  in  four  components:  registrar,  node,  CV,  and  ten¬ 
ant.  We  use  the  IBM  Software  Trusted  Platform  module 
library  [18]  to  directly  interact  with  the  TPM  rather  than 
going  through  a  Trusted  Software  Stack  (TSS)  like  Trousers. 
The  registrar  presents  a  REST-based  web  service  for  en¬ 
rolling  node  AIKs.  It  also  supports  a  query  interface  for 
checking  the  keys  for  a  given  node  UUID.  The  node  com¬ 
ponent  runs  on  the  IaaS  machine,  VM,  or  container  and 
is  responsible  for  responding  to  requests  for  quotes  and  for 
accepting  shares  of  the  bootstrap  key  Kb .  It  provides  an  un¬ 
encrypted  REST-based  web  service  for  these  two  functions. 


To  support  vTPM  operations,  we  created  a  service  the 
IaaS  provider  runs  to  manage  hardware  TPM  activation  and 
vTPM  creation/association.  This  service  runs  in  a  desig¬ 
nated  Xen  domain  and  has  privileges  to  interact  with  the 
Xen  vtpmmgr  domain  [11].  We  then  implemented  a  utility 
for  the  deep  quote  operation.  Since  the  Xen  vTPM  imple¬ 
mentation  does  not  directly  return  the  PCR  values  from  the 
virtual  TPM  (i.e.,  the  shallow  quote)  during  a  deep  quote, 
we  chose  to  first  execute  a  shallow  quote,  hash  its  contents, 
and  place  them  in  the  external  data  field  of  the  deep  quote. 
This  operation  cryptographically  binds  them  together.  We 
then  return  both  the  shallow  and  deep  quotes  and  require 
the  verifier  checks  both  signatures  and  sets  of  PCR  values. 

The  cloud  verifier  hosts  a  TLS-protected  REST-based  web 
service  for  control.  Tenants  add  and  remove  nodes  to  be 
verified  and  also  query  their  current  integrity  state.  Upon 
being  notified  of  a  new  node,  the  CV  enqueues  metadata 
about  the  node  onto  the  quote_request  queue  where  a  con¬ 
figurable  pool  of  worker  processes  will  then  request  a  deep 
quote  from  the  node.  Upon  successful  verification  of  the 
quote,  the  CV  will  use  an  HTTP  POST  to  send  V  to  the 
node.  The  CV  uses  PKCS#1  OEAP  and  with  RSA  2048 
keys  to  protect  shares  of  Kb  in  transit. 

The  tenant  generates  a  random  256-bit  AES  key  and  en¬ 
crypts  and  authenticates  the  bootstrap  data  using  AES  with 
Galois  Counter  Mode  [27].  The  tenant  uses  trivial  XOR- 
based  secret  sharing  to  split  Kb  into  V  and  U.  The  tenant 
executes  a  simplified  version  of  the  same  protocol  that  the 
CV  uses.  The  tenant  checks  with  the  registrar  to  determine 
if  the  quote  signing  AIK  is  valid  and  owned  by  the  tenant. 

Upon  receiving  U  and  V,  the  node  can  then  combine  them 
to  derive  Kb.  To  limit  the  impact  of  rogue  CVs  or  tenants 
connecting  to  the  node’s  unauthenticated  REST  interface, 
the  node  stores  all  received  U  and  V  values  and  iteratively 
tries  each  combination  to  find  the  correct  Kb.  Once  the 
node  has  correctly  derived  Kb ,  it  mounts  a  small  in-memory 
file  system  using  tmpfs  and  writes  the  key  there  for  other 
applications  to  access. 

4.1  Integration 


While  the  key  derivation  protocol  of  key  lime  is  generic 
and  can  be  used  to  decrypt  arbitrary  data,  we  believe  the 
most  natural  cloud  use-case  for  it  is  to  decrypt  a  small  IaaS 
node-specific  package  of  data.  To  enable  this  use-case  we 
have  integrated  key  lime  with  the  cloud- init  package,  the 
combination  we  call  trust ed-cloud-init.  As  described  in 
Section  2,  cloud- init  is  widely  adopted  mechanism  to  de¬ 
ploy  machine-specific  data  to  IaaS  resources.  To  integrate 
keylime  and  cloud-init,  we  patched  cloud-init  to  sup¬ 
port  AES-GCM  decryption  of  the  user-data  (where  cloud- 
init  stores  tenant  scripts  and  data).  We  modified  the  up¬ 
start  system  in  Ubuntu  Linux  to  start  the  keylime  node 
service  before  cloud-init.  We  then  configure  cloud-init 
to  find  the  key  that  keylime  creates  in  the  tmpfs  mounted 
file  system.  After  successful  decryption,  cloud-init  can 
delete  the  key  and  scrub  it  from  memory. 

To  support  applications  that  need  node  identities  that  do 
not  manage  their  own  PKIs,  we  implemented  a  simple  soft¬ 
ware  CA.  The  tenant  provisions  the  software  CA  by  creating 
the  CA  private  key  offline  and  delivering  it  to  a  new  node  us¬ 
ing  trusted-cloud-init.  We  also  deliver  certificates  to  the 
software  CA  that  allow  it  and  the  tenant  to  mutually  authen¬ 
ticate  each  other  via  trusted-cloud-init.  To  demonstrate 
the  clean  separation  between  the  trusted  computing  layer 
and  the  software  key  management  layer,  we  use  the  ZMQ 
Curve  secure  channel  implementation  [14].  This  system  use 
elliptic  curve  cryptography  and  other  techniques  dissimilar 
from  the  cryptographic  algorithms,  keys,  and  other  tech¬ 
niques  in  use  at  the  trusted  computing  layer. 

To  enroll  a  new  node,  the  tenant  first  generates  a  node 
identity  key  pair  using  the  software  CA  client.  The  soft¬ 
ware  CA  supports  a  plugin  architecture  that  allows  the  ten¬ 
ant  to  specify  what  type  of  key  pairs  to  create  (e.g.,  X.509 
RSA  2048).  The  tenant  then  connects  securely  to  the  soft¬ 
ware  CA  over  ZMQ  and  gets  the  node’s  identity  certificate 
signed.  The  tenant  can  now  provision  a  new  node  with  this 
identity  using  trusted-cloud-init.  The  software  CA  also 
supports  receiving  notifications  from  the  CV  if  a  node  later 
fails  integrity  measurement. 

To  support  transparent  integration  with  an  IaaS  plat¬ 
form,  we  patched  OpenStack  Nova  and  libvirt  to  support 
the  creation  of  companion  vTPM  Xen  domains  for  each 
user  created  instance.  We  link  the  OpenStack  UUID  to  the 
keylime  provider  registrar.  We  then  implemented  a  wrap¬ 
per  for  the  OpenStack  Nova  command  line  interface  that 
enables  trusted-cloud-init.  Specifically,  our  wrapper  in¬ 
tercepts  calls  to  nova  boot  and  automatically  encrypts  the 
provided  user-data  before  passing  it  to  OpenStack.  It  then 
calls  the  keylime  tenant,  which  begins  the  bootstrapping 
protocol.  This  allows  OpenStack  users  to  transparently  use 
keylime  and  cloud-init  without  needing  to  fully  trust  the 
OpenStack  provider  not  to  tamper  or  steal  the  sensitive  con¬ 
tents  of  their  user-data. 

4.2  Demonstration  Applications 

We  next  describe  how  keylime  can  securely  bootstrap 
and  handle  revocation  for  existing  non-trusted  computing- 
aware  applications  and  services  common  to  IaaS  cloud  de¬ 
ployments. 

IPsec  To  enable  secure  network  connectivity  similar  to 
TNC  [38],  we  implemented  trusted-cloud-init  scripts  to 
automatically  encrypt  all  network  traffic  between  a  tenant’s 
IaaS  resources.  The  scripts  use  the  OpenStack  API  for  IP 


address  information  and  then  build  configurations  for  the 
Linux  IPsec  stack  and  raccoon11. 

Puppet  To  enable  secure  system  configuration  manage¬ 
ment,  we  integrated  keylime  with  Puppet  We  do  so  by  gen¬ 
erating  the  signed  RSA  keys  that  Puppet  uses  to  communi¬ 
cate  with  the  Puppet  master  using  the  Software  CA  process 
described  previously.  These  steps  bypass  the  need  to  either 
use  the  insecure  autosign  option  in  the  Puppet  master  to 
blindly  accept  new  nodes  or  to  have  an  operator  manually 
approve/deny  certificate  signing  requests  from  new  nodes. 
To  support  continuous  attestation  and  integrity  measure¬ 
ment,  we  implemented  a  plug-in  for  the  CV  that  notifies  the 
tenant’s  Puppet  master  when  a  node  fails  its  integrity  mea¬ 
surements.  The  master  can  then  revoke  that  node’s  access 
to  check-in  and  receive  the  latest  configuration  data. 

Vault  While  tools  like  Puppet  are  often  used  to  provi¬ 
sion  secrets  and  keys,  tenant  operators  can  instead  use  a 
dedicated  secret  management  system  that  supports  the  full 
lifecycle  of  cryptographic  keys  directly.  To  demonstrate  this, 
we  have  integrated  keylime  with  Vault,  a  cloud-compatible 
secret  manager.  Like  Puppet,  we  use  the  Software  CA  to 
provision  RSA  certificates  for  each  node  and  configure  Vault 
to  use  them.  We  also  implemented  a  revocation  plugin  for 
the  CV  that  notifies  Vault  to  both  revoke  access  to  a  node 
that  fails  integrity  measurement  and  to  re-generate  and  re¬ 
distribute  any  keys  to  which  that  node  had  access. 

LUKS  Finally,  to  demonstrate  our  ability  to  provision 
secrets  instead  of  cryptographic  identities,  we  implemented  a 
trusted-clout-init  script  that  provides  the  key  to  unlock 
an  encrypted  volume  on  boot. 

5.  EVALUATION 

In  this  section  we  evaluate  the  overhead  and  scalability  of 
keylime  in  a  variety  of  scenarios.  We  ran  our  experiments 
on  a  private  OpenStack  cluster,  a  Xen  host,  and  a  Raspberry 
Pi.  In  OpenStack,  we  used  standard  instance  flavors  where 
the  ml. small  has  1  vCPU,  2GB  RAM,  and  a  20GB  disk, 
and  the  ml. large  has  4  vCPUs,  8GB  RAM,  an  80GB  disk. 
We  used  Ubuntu  Linux  14.10  as  the  guest  OS  in  OpenStack 
instances.  The  Xen  host  had  one  Xeon  E5-2640  CPU  with 
6  cores  at  2.5Ghz,  10Gbit  NIC,  64  GB  RAM,  a  WinBond 
TPM,  and  ran  Xen  4.5  on  Ubuntu  Linux  15.04.  The  Rasp¬ 
berry  Pi  2  had  one  ARMv7  with  4-cores  at  900Mhz,  1GB 
RAM,  100Mbit  NIC,  and  ran  Raspbian  7.  We  ran  each  of 
the  following  experiments  for  1-2  minutes  and  present  aver¬ 
ages  of  the  performance  we  observed. 

5.1  TPM  Operations 

We  first  establish  a  baseline  for  the  performance  of  TPM 
operations  with  the  IBM  client  library,  our  Python  wrap¬ 
per  code,  the  Xen  virtual  TPM,  and  the  physical  TPM.  We 
benchmarked  both  TPM  quote  creation  and  verification  on 
the  Xen  host  (Table  1).  We  collected  the  physical  TPM  mea¬ 
surements  on  the  same  system  with  a  standard  (non-Xen) 
kernel.  We  collected  both  vTPM  quote  and  deep  quote  mea¬ 
surements  from  a  domain  running  on  Xen.  As  expected,  op¬ 
erations  that  require  interaction  with  the  physical  TPM  are 
slow.  Verification  times,  even  for  deep  quotes  that  include 
two  RSA  signature  verifications,  are  comparatively  quick. 

5.2  Key  Derivation  Protocol 

1 1  http:  /  /  ipsec-tools.sourceforge.net  / 


Figure  6:  Latency  of  keylime  boot¬ 
strapping  protocol. 


Figure  7:  Latency  of  TPM  opera¬ 
tions  in  bootstrapping  protocol. 


Figure  8:  Maximum  CV  quote 
checking  rate  of  keylime. 


Table  1:  Average  TPM  Operation  Latency  (ms). 


TPM 

vTPM 

Deep  quote 

Create  Quote 

725 

68.5 

1395 

Check  Quote 

4.64 

4.64 

5.33 

We  next  investigate  the  CV  latency  of  different  phases 
of  our  protocol.  In  Figures  6  and  7,  we  show  the  aver¬ 
aged  results  of  hundreds  of  trials  of  the  CV  with  100  vTPM 
equipped  VMs.  Each  operation  includes  a  full  REST  in¬ 
teraction  along  with  the  relevant  TPM  and  cryptographic 
operations.  We  also  benchmarked  the  latency  of  the  pro¬ 
tocol  phases  emulating  zero  latency  from  the  TPM  (Null 
TPM).  This  demonstrates  the  minimum  latency  of  our  CV 
software  architecture  including  the  time  required  to  verify 
quotes.  The  results  from  the  Null  TPM  trials  indicate  that 
our  network  protocol  and  other  processing  impose  very  little 
additional  overhead,  even  on  the  relatively  modestly  pow¬ 
ered  Raspberry  Pi.  The  bare  metal  system  had  a  slightly 
larger  network  RTT  to  the  nodes  it  was  verifying  causing  it 
to  have  a  higher  latency  than  the  less  powerful  ml. large. 

In  Figure  7,  we  see  that  latency  for  the  quote  retrieval 
process  is  primarily  affected  by  slow  TPM  operations  and  is 
comparable  to  prior  work  [31].  The  bootstrapping  latency 
is  the  sum  of  the  latencies  for  retrieving  a  quote  and  provid¬ 
ing  V.  We  find  the  bootstrapping  latency  for  bare  metal  and 
virtual  machines  to  be  approximately  793ms  and  1555ms  re¬ 
spectively.  Virtual  nodes  doing  runtime  integrity  measure¬ 
ment  after  bootstrapping  benefit  from  much  lower  latency 
for  vTPM  operations.  Thus,  for  a  virtual  machine  with  a 
vTPM,  keylime  can  detect  integrity  violations  in  as  little  as 
110ms.  The  detection  latency  for  a  system  with  a  physical 
TPM  (781ms  for  our  Xen  host)  is  limited  by  the  speed  of 
the  physical  TPM  at  generating  quotes. 

5.3  Scalability  of  Cloud  Verifier 

Next  we  establish  the  maximum  rate  at  which  the  CV  can 
get  and  check  quotes  for  sustained  integrity  measurement. 
This  will  define  the  trade-off  between  the  number  of  nodes 
a  single  CV  can  handle  and  the  latency  between  when  an 
integrity  violation  occurs  and  the  CV  detects  it.  Recall  that 
the  CV  quote  checking  process  is  an  embarrassingly  parallel 
problem  where  the  CV  does  a  round  robin  check  of  each 
node.  We  emulate  an  arbitrarily  large  population  of  real 
cloud  nodes  using  a  fixed  number  test  cloud  nodes.  These 
test  cloud  nodes  emulate  a  zero  latency  TPM  by  returning  a 
pre-constructed  quote.  This  way  the  test  nodes  appear  like 


Figure  9:  Scaling  the  CV  on  bare  metal 


a  larger  population  where  the  CV  will  never  have  to  block 
for  a  lengthy  TPM  operation  to  complete.  We  found  that 
around  500  zero  latency  nodes  were  sufficient  to  achieve  the 
maximum  quote  checking  rate. 

We  show  the  average  number  of  quotes  verified  per  second 
for  each  of  our  CV  deployment  options  in  Figure  8.  Because 
of  our  process-based  worker  pool  architecture,  the  primary 
factor  affecting  CV  scalability  is  the  number  of  cores  and 
RAM  available.  These  options  provide  a  variety  of  choices 
for  deploying  the  CV.  For  small  cloud  tenants  a  low  cost 
VM  or  inexpensive  hardware  appliance  can  easily  verify  hun¬ 
dreds  of  virtual  machines  with  moderate  detection  latency 
(5-10s).  For  larger  customers,  a  well-resourced  VM  or  dedi¬ 
cated  hardware  can  scale  to  thousands  with  similar  latency. 
For  high  security  environments,  all  options  can  provide  sub¬ 
second  detection  and  response  time.  In  future  work,  we  plan 
to  implement  a  priority  scheme  that  allows  the  tenant  to  set 
the  rate  of  verifications  for  different  classes  of  nodes. 

We  next  show  how  our  CV  architecture  can  scale  by  adding 
more  cores  and  parallelism.  We  use  the  bare  metal  CV  and 
show  the  average  rate  of  quotes  retrieved  and  checked  per 
second  for  500  test  nodes  in  Figure  9.  We  see  linear  speed¬ 
up  until  we  exhaust  the  parallelism  of  the  host  CPU  and 
the  concurrent  delay  of  waiting  for  many  cloud  nodes.  This 
performance  represents  a  modest  performance  improvement 
over  Schiffman  et  al.’s  CV  which  was  able  to  process  approx¬ 
imately  2,000  quotes  per  second  on  unspecified  hardware  [34] 
and  a  substantial  improvement  over  the  Excalibur  monitor’s 


Table  2:  On-Premises  bare  metal  CV  verifying  250 
Cloud  Nodes  using  50  CV  processes. 


Network  RTT 

(ms) 

Rate 

(quotes/s) 

Bandwidth 

(Kbits/s) 

4ms  (native) 

937 

3084.6 

25ms 

613 

2017.0 

50ms 

398 

1310.2 

75  ms 

282 

928.3 

100  ms 

208 

684.7 

150  ms 

141 

464.2 

ability  to  check  approximately  633  quotes  per  second  [31]. 

5.4  On-Premises  Cloud  Verifier 

Finally,  we  investigate  the  performance  of  the  CV  when 
hosted  at  the  tenant  site  away  from  the  cloud.  We  show 
the  results  of  our  bare  metal  system’s  quote  verification 
rate  and  the  bandwidth  used  for  a  variety  of  network  de¬ 
lays  we  inserted  using  the  comcast12  tool  in  Table  2.  These 
results  show  that  it  is  possible  to  run  the  CV  on-premises  at 
the  tenant  site  at  the  cost  of  a  reduction  in  quote  checking 
rate  (and  therefore  detection  latency)  and  several  Mbit/s 
of  bandwidth.  As  such,  we  recommend  the  highest  perfor¬ 
mance  and  lowest  cost  option  is  to  run  the  CV  in  the  cloud 
alongside  the  nodes  it  will  verify. 

6.  RELATED  WORK 

Many  of  the  challenges  that  exist  in  traditional  enterprise 
networks  exist  in  cloud  computing  environments  as  well. 
However,  there  are  new  challenges  and  threats  including 
shared  tenancy  and  additional  reliance  on  the  cloud  provider 
to  provide  a  secure  foundation  [4,  16].  To  address  some  of 
the  challenges,  many  have  proposed  trusted  computing. 

The  existing  specifications  for  trusted  computing  rely  on 
trusted  hardware,  and  assume  a  single  owner  of  the  system. 
With  the  advent  of  cloud  computing,  this  assumption  is  no 
longer  valid.  While  both  the  standards  community  [40]  and 
prior  work  [2]  is  beginning  the  process  of  supporting  virtu¬ 
alization,  no  end-to-end  solution  exists.  For  example,  the 
cTPM  system  [5]  assumes  a  trustworthy  cloud  provider  and 
requires  modifications  to  trusted  computing  standards.  An¬ 
other  proposal  for  higher-level  validation  of  services  provides 
a  cryptographically  signed  audit  trail  that  the  hypervisor 
provides  to  auditors  [12].  The  audit  trail  captures  the  exe¬ 
cution  of  the  applications  within  the  virtual  machine.  This 
proposal  does  not  provide  a  trusted  foundation  for  the  audit 
trail,  and  assumes  a  benign  hypervisor.  Bleikertz,  et  ah,  pro¬ 
pose  to  use  trusted  computing  to  provide  cryptographic  ser¬ 
vices  for  cloud  tenants[3].  Their  Cryptography- as- a- Service 
(CaaS)  system  relies  on  trusted  computing,  but  does  not  ad¬ 
dress  bootstrapping  and  requires  hypervisor  modifications 
that  cloud  providers  are  unlikely  to  support. 

To  address  the  issues  of  scalability,  several  proposals  ex¬ 
ist  to  monitor  a  cloud  infrastructure,  and  allow  for  valida¬ 
tion  of  the  virtual  machines  controlled  by  the  tenants  of  the 
cloud  [34,  32,  33,  35].  The  cloud  verifier  pattern  proposed 
by  Schiffman,  et  ah,  allows  a  single  verifier  to  validate  trust 
in  the  cloud  infrastructure,  and  in  turn  the  cloud  verifier 
“vouches”  for  the  integrity  of  the  cloud  nodes.  This  removes 

1 2  https :  /  /  git  hub .  com  /  ty  lert  reat  /  Comcast 


the  need  for  tenants  to  validate  the  integrity  of  the  hyper¬ 
visor  hosts  prior  to  instantiating  cloud  nodes  on  them  and 
avoids  the  need  for  nodes  to  mutually  attest  each  other  be¬ 
fore  communicating.  The  tenant  simply  provides  their  in¬ 
tegrity  verification  criteria  to  the  verifier,  and  the  verifier  en¬ 
sures  that  the  tenant’s  integrity  criteria  are  satisfied  as  part 
of  scheduling  resources.  We  utilize  the  cloud  verifier  pat¬ 
tern  in  our  work,  with  some  important  differences.  First  we 
extend  it  to  support  secure  system  bootstrapping  for  both 
bare  metal  and  virtualized  IaaS  environments.  Second,  we 
do  not  host  any  tenant-owned  parts  of  the  integrity  mea¬ 
surement  infrastructure  in  the  provider’s  control  as  they  do. 
This  means  that  our  solution  is  substantially  less  invasive 
to  the  cloud  provider’s  infrastructure  (e.g.,  they  required 
nearly  5,000  lines  of  code  to  be  added  to  OpenStack)  and 
is  less  prone  to  compromise.  For  example,  keylime  relies 
upon  the  vTPM  integrity  measurements  inside  tenant  nodes 
rather  than  enabling  the  cloud  provider  to  have  explicit  vir¬ 
tual  machine  introspection  (i.e.,  secret  stealing)  capabilities. 

Excalibur  aims  to  address  the  scalability  problems  of  trusted 
computing  by  leveraging  ciphertext  policy  attribute-based 
encryption  (CPABE)  [31].  This  encryption  scheme  allows 
data  to  be  encrypted  using  keys  that  represent  attributes  of 
the  hypervisor  hosts  in  the  IaaS  environment  (e.g.,  software 
version,  country,  zone).  Using  Excalibur,  clients  can  encrypt 
sensitive  data,  and  be  assured  that  a  hypervisor  will  only 
be  given  access  to  the  data  if  the  policy  (the  specified  set 
of  attributes)  is  satisfied.  Excalibur  only  addresses  trusted 
bootstrapping  for  the  underlying  cloud  platform.  Therefore, 
a  compromised  tenant  node  would  be  neither  detected  nor 
prevented.  The  Excalibur  monitor  is  a  provider-owned  (but 
attested)  component  that  holds  the  encryption  keys  that  al¬ 
low  a  node  to  boot  on  a  particular  hypervisor,  keylime  uses 
secret  sharing  to  avoid  having  bootstrap  key  stored  (and 
therefore  vulnerable)  in  any  cloud  systems  except  for  in  the 
cloud  node  for  which  they  are  intended. 

The  CloudProxy  Tao  system  provides  building  blocks  to 
establish  trusted  services  in  a  layered  cloud  environment  [25]. 
The  Tao  environment  relies  on  the  TPM  to  establish  iden¬ 
tity  and  load-time  integrity  of  the  nodes  and  software  in 
the  system.  Their  system  does  not  support  system  integrity 
monitoring  as  they  assume  that  all  interactions  will  only  be 
with  other  trusted  programs  running  in  the  Tao  environ¬ 
ment.  Tao  relies  upon  mutual  attestation  for  all  communi¬ 
cating  nodes,  but  is  unable  to  use  TPM-based  keys  because 
they  are  not  fast  enough  to  support  mutual  attestation.  Us¬ 
ing  the  out-of-band  CV,  we  avoid  mutual  attestation  while 
maintaining  rapid  detection  of  integrity  violations.  The  Key 
Server  in  Tao  holds  all  the  secret  keys  to  the  system,  must 
interact  with  hosts  to  load  new  applications,  and  must  be 
fully  trusted.  The  Key  Server  does  not  offer  compatible  de¬ 
ployment  options  for  IaaS  environments,  especially  for  small 
tenants  who  cannot  afford  secure  facilities  or  hardware  se¬ 
curity  modules.  Furthermore,  CloudProxy  Tao  does  not  de¬ 
tail  the  secure  bootstrapping  of  their  Key  Server  or  Privacy 
CA  component  for  TPM  initialization,  keylime  explicitly 
describes  bootstrapping  of  all  relevant  components  and  en¬ 
ables  multiple  realistic  secure  deployment  options  for  CV 
and  registrar  hosting. 

7.  CONCLUSION 

In  this  paper,  we  have  shown  that  keylime  provides  a 
fully  integrated  solution  to  bootstrap  and  maintain  hardware- 


rooted  trust  in  elastically  provisioned  IaaS  clouds.  We  have 
demonstrated  that  we  can  securely  bootstrap  hardware-rooted 
cryptographic  identities  into  cloud  nodes,  and  leverage  those 
identities  in  higher-level  security  services,  without  requiring 
each  service  to  become  trusted  computing- aware,  keylime  also 
uses  a  novel  key  derivation  protocol  that  incorporates  a  ten¬ 
ant’s  intent  to  provision  new  cloud  resources  with  integrity 
measurement.  Finally,  we  have  demonstrated  and  evaluated 
several  deployment  scenarios  for  our  system’s  critical  com¬ 
ponent,  the  cloud  verifier.  Overall,  keylime  introduces  less 
than  two  seconds  of  overhead  during  the  provisioning  and 
bootstrapping  process,  and  requires  as  little  as  110ms  to  re¬ 
spond  to  an  integrity  violation.  Furthermore,  we  have  shown 
that  keylime  can  scale  to  support  thousands  of  IaaS  nodes 
while  maintaining  quick  response  to  integrity  violations. 
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